-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) #61909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…or string dtype (empty string instead of 0)
pandas/_libs/groupby.pyx
Outdated
@@ -729,6 +730,10 @@ def group_sum( | |||
sumx = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype) | |||
compensation = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype) | |||
|
|||
if is_string: | |||
# for strings start with empty string instead of 0 as `initial` value | |||
sumx = np.full((<object>out).shape, "", dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would passing “initial” be more general/idiomatic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially thinking that as well, but then because this would in practice only be used for the specific case of strings, I thought to be more explicit about that fact. But in both cases I have to pass it down a few layers, so whether it is then called initial or is_string actually does not matter much, and initial
at least makes it clearer where it is called from the EA _groupby_ops what the purpose is of passing it down.
So updated to use initial
in the last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
I ran into one more case of the sum of empty / all-NaN to use "0" instead of empty string (#60229), specifically when effectively introducing empty groups with categorical data with observed=False.
Follow-up on #60936
The passing through of
is_string
through several layers is a bit annoying, but effectively is needed to for now only changes this for string dtype, and not for object dtype in general (which in the other PR related to this, we did for now)